{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). This means that you are able to copy, share and modify the work, as long as the result is distributed under the same license.**\n",
    "\n",
    "# Concordia Python workshop - module 3 - Intorduction to Biopython   \n",
    "\n",
    "by Mathieu Bourgey, _Ph.D_  \n",
    "\n",
    "This Workshop is an adaptation of some interesting point of the [general Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html)\n",
    "\n",
    "## Learning objectives\n",
    "During this wiorkshop you will learn:  \n",
    "\n",
    " - Manipulate Sequence Objects\n",
    " - Annotate Sequence Objects\n",
    " - Read/write Fasta file\n",
    " - Blasting Sequences\n",
    " - Multiple Alignment Sequences \n",
    "\n",
    "\n",
    "\n",
    "## The Seq Object\n",
    "Biological sequences are arguably the central object in Bioinformatics. Biopython sequences are essentially strings of letters like **AGTACACTGGT** as seen in common biological file formats.\n",
    "\n",
    "The Biopython `Seq` object is defined in the `Bio.Seq` module\n",
    "\n",
    "The `Seq` object is different from traditional python strings:  \n",
    "\n",
    " 1. It supports most of the string methods but it also comes with its specifc set of methods\n",
    "  * translate() _- Turns a nucleotide sequence into a protein sequence._\n",
    "  * reverse_complement() _- Returns the reverse complement sequence._\n",
    "  * complement() _- Returns the complement sequence._\n",
    "  * transcribe() _-Returns the RNA sequence from a DNA sequence._\n",
    "  * back_transcribe() _- Returns the DNA sequence from an RNA sequence._\n",
    "  * ungap() _- Return a copy of the sequence without the gap character(s)._\n",
    " 2. It has an important attribute, the alphabet, which is an object describing the type of the sequence and how the characters should be interpreted. Biopython alphabet are define in the `Bio.Alphabet` module\n",
    "\n",
    " \n",
    "[The detail API of the `Seq` object](http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html)\n",
    "\n",
    "\n",
    "[The detail API of the `Alphabet` object](http://biopython.org/DIST/docs/api/Bio.Alphabet-module.html)\n",
    " \n",
    "### Sequence Manipulation\n",
    "We’ll use the [IUPAC](http://www.chem.qmw.ac.uk/iupac/) alphabets here to deal with some of our favorite objects: DNA, RNA and Proteins.\n",
    "\n",
    "We can create an ambiguous sequence with the default generic alphabet like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio.Seq import Seq\n",
    "my_seq = Seq(\"AGTACACTGGT\")\n",
    "my_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('AGTACACTGGT', Alphabet())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_seq.alphabet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Alphabet()\n",
    "\n",
    "In the above example, we haven't specified an alphabet so we end up with\n",
    "a default generic alphabet. Biopython doesn't know if this is a\n",
    "nucleotide sequence or a protein rich in alanines, glycines, cysteines\n",
    "and threonines. If *you* know, you should supply this information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio.Alphabet import IUPAC\n",
    "my_seq = Seq(\"AGTACACTGGT\", IUPAC.unambiguous_dna)\n",
    "my_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('AGTACACTGGT', IUPACUnambiguousDNA())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_seq.alphabet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> IUPACUnambiguousDNA()\n",
    "\n",
    "####  Seq as python strings\n",
    "\n",
    "In many ways, we can deal with Seq objects as if they were normal Python strings, for example getting the length, or iterating over the elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for index, letter in enumerate(my_seq):\n",
    "   print(\"%i %s\" % (index, letter))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 0 A   \n",
    "> 1 G   \n",
    "> 2 T   \n",
    "> 3 A   \n",
    "> 4 C   \n",
    "> 5 A   \n",
    "> 6 C   \n",
    "> 7 T   \n",
    "> 8 G   \n",
    "> 9 G   \n",
    "> 10 T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(my_seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 11\n",
    "\n",
    "The Seq object has a `.count()` method, just like a string. Note that this means that like a Python string, this gives a non-overlapping count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_seq.count(\"A\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_seq.count(\"GT\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 2\n",
    "\n",
    "Note that this means that like a Python string, this gives a non-overlapping count:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Seq(\"AAAA\").count(\"AA\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 2\n",
    "\n",
    "####  Slicing a Seq object\n",
    "\n",
    "let’s get a slice of the sequence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_seq[2:8]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('TACACT', IUPACUnambiguousDNA())\n",
    "\n",
    "Note that it follows the normal conventions for Python strings. \n",
    " - So the first element of the sequence is 0 (which is normal for computer science, but not so normal for biology).\n",
    " - The first item is included i.e. 2 in this case (3rd aa)\n",
    " - The last is excluded i.e 8 in this case (9th aa)\n",
    "\n",
    "###### Exercice\n",
    "\n",
    "**In one command could you extract the third codon positions of this DNA sequence ?**\n",
    "\n",
    "[solution](solutions/_codonExtract.ipynb)\n",
    "\n",
    "#### Concatenate sequence\n",
    "You can in principle add any two Seq objects together just like you can with Python strings. But `Seq` object are made for biological data so you the concatenation method only accept to merge sequences with compatible alphabets. You are allowed to concatenate a protein sequence and a DNA sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_seq = Seq(\"EVRNAK\", IUPAC.protein)\n",
    "d_seq = Seq('TACACT', IUPAC.unambiguous_dna)\n",
    "d_seq + my_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('TACACTAGTACACTGGT', IUPACUnambiguousDNA())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_seq + my_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Traceback (most recent call last):   \n",
    "> ...   \n",
    "> TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()\n",
    "\n",
    "**If you __realy__ want to do that, how should you do ?** [solution](solutions/_concate.ipynb)\n",
    "\n",
    "#### Seq as Biological strings\n",
    "\n",
    "The `Seq` object is more than a python string with a specific alphabet, it also offers methods specific to facilitate the biology oriented analysis.\n",
    "\n",
    "#### Complement and reverse complement\n",
    "DNA is double stranded but in most case sequence are represented as single stranded molecules. for many purpose , i.e alignment, we need to compare a query sequence to reference sequence. In this case, we need to know if the reference sequence contains the query sequence in one or the other strands. \n",
    "\n",
    "You can easily obtain the complement or reverse complement of a Seq object using its built-in methods:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_seq.complement()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('TCATGTGACCA', IUPACUnambiguousDNA())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_seq.reverse_complement()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('ACCAGTGTACT', IUPACUnambiguousDNA())\n",
    "\n",
    "\n",
    "There is no specific methods in Biopython to only reverse your sequence\n",
    "\n",
    "\n",
    "**Do you know why and how to proceed ?**   \n",
    "[solution](solutions/_reverse.ipynb)\n",
    "\n",
    "\n",
    "Note that these methods only work for dna alphabet. Trying to (reverse)complement a protein sequence will raise you an error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_seq = Seq(\"EVRNAK\", IUPAC.protein)\n",
    "p_seq.reverse_complement()\n",
    "\n"
    ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Traceback (most recent call last):__ \n",
    "> ...   \n",
    "> ValueError: Proteins do not have complements!\n",
    "\n",
    "#### Transcirption, reverse transcription and translation \n",
    "First we need to clarify the transcription strand issue !   \n",
    "\n",
    "![Transcription strand](img/Transcription_strand.png)\n",
    "\n",
    "Biologically the transcription do a reverse complement of the template strand while inserting Uracile instead of Thymine (TCAG → CUGA) to give the RNA.\n",
    "\n",
    "However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching T → U\n",
    "\n",
    "Let's do a simple transcription of our sequence:\n",
    "\n"
    ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_seq=my_seq.transcribe()\n",
    "r_seq\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('AGUACACUGGU', IUPACUnambiguousRNA())\n",
    "\n",
    "And a reverse transcription of the resulting sequence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_seq.back_transcribe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('AGTACACTGGT', IUPACUnambiguousDNA())\n",
    "\n",
    "As you can see, all this does is switch T -> U or U -> T and adjust the alphabet.\n",
    "\n",
    "\n",
    "###### Exercice\n",
    "\n",
    "**Could you generate the mRNA from this template strand sequence: __3'-TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC-5'__ ?**  \n",
    "\n",
    "[solution](solutions/_mRNA.ipynb)\n",
    "\n",
    "\n",
    "Now let’s translate this mRNA into the corresponding protein sequence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_seq = r_seq.translate()\n",
    "p_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))\n",
    "\n",
    "Note that Biopython also allow to directly translate from DNA sequence. In this case use the coding strand DNA sequence.\n",
    "\n",
    "You should also notice in the above protein sequences that in addition to the end stop character, there is an internal stop as well. This is due to the use of the wrong translation table in this case.\n",
    "\n",
    "The translation tables available in Biopython are based on those from the [NCBI](http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). By default, translation will use the standard genetic code (NCBI table id 1). \n",
    "\n",
    "In our case, we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_seq = r_seq.translate(table=\"Vertebrate Mitochondrial\")\n",
    "p_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))\n",
    "\n",
    "Also you may want to translate the nucleotides up to the first in frame stop codon, and then stop. In this case you should use to_stop method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_seq.translate(to_stop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('MAIVMGR', IUPACProtein())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_seq.translate(table=2, to_stop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Seq('MAIVMGRWKGAR', IUPACProtein())\n",
    "\n",
    "Notice that when you use the to_stop argument, the stop codon itself is not translated, Notice also that you can specify the table using the NCBI table number which is shorter.\n",
    "\n",
    "## The SeqRecord object\n",
    "Immediately “above” the `Seq` class is the Sequence Record or `SeqRecord` class, defined in the `Bio.SeqRecord` module. This class allows higher level features to be associated with the sequence.\n",
    "\n",
    "[The detail API of the `SeqRecord` object](http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html)\n",
    "\n",
    "A SeqRecord object holds a sequence and information about it.\n",
    "\n",
    "**Main attributes:\n**",
    " - .id - Identifier such as a locus tag or an accesion number (string)\n",
    " - .seq - The sequence itself (Seq object or similar)\n\n",
    "**Additional attributes:**",
    " - .name - Sequence name, e.g. gene name (string)\n",
    " - .description - A human readable description or expressive name for the sequence (string)\n",
    " - .letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds Python sequences (lists, strings or tuples) whose length matches that of the sequence. A typical use would be to hold a list of integers representing sequencing quality scores, or a string representing the secondary structure.\n",
    " - .features - Any (sub)features defined (list of `SeqFeature` objects), i.e location, type or strand...\n",
    " - .annotations - Further information about the whole sequence (dictionary). Most entries are strings, or lists of strings. This allows the addition of more “unstructured” information to the sequence.\n",
    " - .dbxrefs - List of database cross references (list of strings)",
    "Using a `SeqRecord` object is not very complicated, since all of the information is presented as attributes of the class. Usually you won’t create a `SeqRecord` “by hand”, but instead you a specific Class to read in a sequence file for you (presented in the next section). However, creating `SeqRecord` can be quite simple.\n",
    "\n",
    "\n",
    "### Manually creating a seqRecord\n",
    "\n",
    "To create a `SeqRecord` at a minimum you just need a `Seq` object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio.SeqRecord import SeqRecord\n",
    "simple_seq_r = SeqRecord(my_seq)\n",
    "simple_seq_r\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SeqRecord(seq=Seq('AGTACACTGGT', IUPACUnambiguousDNA()), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[])\n",
    "\n",
    "We can also manually pass the id, name and description to the object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simple_seq_r.id = \"THX1138\"\n",
    "simple_seq_r.name = \"THX 1138 4EB\"\n",
    "simple_seq_r.description = \"Made up sequence I wish I could write a paper about\"\n",
    "simple_seq_r\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SeqRecord(seq=Seq('AGTACACTGGT', IUPACUnambiguousDNA()), id='THX1138', name='THX 1138 4EB', description='Made up sequence I wish I could write a paper about', dbxrefs=[])\n",
    "\n",
    "Including an identifier is very important if you want to output your SeqRecord to a file.\n",
    "\n",
    "The `SeqRecord` has an dictionary attribute annotations. This is used for any miscellaneous annotations that doesn’t fit under one of the other more specific attributes. Adding annotations is easy, and just involves dealing directly with the annotation dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simple_seq_r.annotations[\"evidence\"] = \"None. I just made it up.\"\n",
    "simple_seq_r.annotations\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> {'evidence': 'None. I just made it up.'}\n",
    "\n",
    "Working with per-letter-annotations is similar, letter_annotations is a dictionary like attribute which will let you assign any Python sequence (i.e. a string, list or tuple) which has the same length as the sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "simple_seq_r.letter_annotations[\"phred_quality\"] =  random.sample(xrange(1, 50),len(simple_seq_r))\n",
    "simple_seq_r.letter_annotations\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> {'phred_quality': [22, 23, 3, 2, 29, 11, 34, 44, 5, 33, 16]}\n",
    "\n",
    "The `format()` method of the `SeqRecord` class gives a string containing your record formatted using one of the output file formats supported."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simple_seq_r.format('fasta')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> '>THX1138 Made up sequence I wish I could write a paper about\\nAGTACACTGGT\\n'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simple_seq_r.format('fastq')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> '@THX1138 Made up sequence I wish I could write a paper about\\nAGTACACTGGT\\n+\\n78$#>,CM&B1\\n'\n",
    "\n",
    "## The SeqIO Class\n",
    "The `SeqIO` Class provide a simple interface for working with assorted sequence file formats in a uniform way.\n",
    "\n",
    "The “catch” is that you have to work with `SeqRecord` objects, which contain a `Seq` object with format specific annotation like an identifier and description.\n",
    "\n",
    "\n",
    "[The detail API of the `SeqIO` module](http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html)\n",
    "\n",
    "### Reading or parsing sequence files \n",
    "\n",
    "The main method of the `SeqIO` module is __.parse()__  is used to read in sequence data as `SeqRecord` objects. This function expects two arguments:\n",
    "1. A handle to read the data from, or a filename.\n",
    "2. A lower case string specifying sequence format. See http://biopython.org/wiki/SeqIO for a full listing of supported formats.\n",
    "There is an optional argument alphabet to specify the alphabet to be used. This is useful for file formats like FASTA where otherwise Bio.SeqIO will default to a generic alphabet.\n",
    "\n",
    "The __.parse()__ methods is typically used with a for loop like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio import SeqIO\n",
    "for seq_record in SeqIO.parse(\"data/NC_000913.fna\",\"fasta\"):\n",
    "    print(seq_record.id)\n",
    "    print(repr(seq_record.seq))\n",
    "    print(len(seq_record))\n",
    "    print(len(seq_record.features))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> gi|556503834|ref|NC_000913.3|   \n",
    "> Seq('AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAG...TTC', SingleLetterAlphabet())   \n",
    "> 4641652   \n",
    "> 0\n",
    "\n",
    "\n",
    "If instead you wanted to load a GenBank format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for seq_record in SeqIO.parse(\"data/NC_000913.gbk\",\"genbank\"):\n",
    "    print(seq_record.id)\n",
    "    print(repr(seq_record.seq))\n",
    "    print(len(seq_record))\n",
    "    print(len(seq_record.features))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> gi|556503834|ref|NC_000913.3|   \n",
    "> Seq('AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAG...TTC', SingleLetterAlphabet())   \n",
    "> 4641652   \n",
    "> 22996 \n",
    "\n",
    "\n",
    "Notice that in the 2 examples the `SeqRecord` object contain the same central information and that mostly the difference is in the amount of information provided in the annotations (__.features__) \n",
    "\n",
    "Similarly, if you wanted to read in a file in another file format, you would just need to change the format string as appropriate. For example “swiss” for SwissProt files or “embl” for EMBL text files. There is a full listing on the [Biopython wiki page](http://biopython.org/wiki/SeqIO)\n",
    "\n",
    "Another very common way to use a Python iterator is within a list comprehension"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "identifiers = [seq_record.id for seq_record in SeqIO.parse(\"data/patato_pep.fasta\",\"fasta\")]\n",
    "identifiers\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ['PGSC0003DMP400067339', 'PGSC0003DMP400027454', 'PGSC0003DMP400020381', 'PGSC0003DMP400022612', 'PGSC0003DMP400040011', 'PGSC0003DMP400032361', 'PGSC0003DMP400030628', 'PGSC0003DMP400028584', 'PGSC0003DMP400060824', 'PGSC0003DMP400037883']\n",
    "\n",
    "We usually use a for loop to iterate over all the records one by one. The object returned by `SeqIO` is actually an iterator which returns `SeqRecord` objects. You get to access each record in turn, but once and only once. The plus point is that an iterator can save you memory when dealing with large files.\n",
    "\n",
    "Sometime you need to be able to access the records in any order. In that case you could acces the records using a list or dictionary process. But be carefull because you could easily carsh or slowdown python when using this methods on large data sets.\n",
    "\n",
    "#### The list approach\n",
    "We can turn the record iterator into a list of SeqRecord objects using the built-in Python function __list()__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "records = list(SeqIO.parse(\"data/patato_pep.fasta\",\"fasta\"))\n",
    "records[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SeqRecord(seq=Seq('MLPFESIEEASMSLGRNLTFGETLWFNYTADKSDFYLYCHNTIFLIIFYSLVPL...SE*', SingleLetterAlphabet()), id='PGSC0003DMP400022612', name='PGSC0003DMP400022612', description='PGSC0003DMP400022612 PGSC0003DMT400033236 Protein', dbxrefs=[])\n",
    "\n",
    "#### The dictionary approach\n",
    "`SeqIO` module has three related functions which allow dictionary like random access to a multi-sequence file. There is a trade off here between flexibility and memory usage. In summary:  \n",
    "\n",
    "   - `.to_dict()` is the most flexible but also the most memory demanding option. This is basically a helper function to build a normal Python dictionary with each entry held as a SeqRecord object in memory, allowing you to modify the records.\n",
    "   -  `.index()` is a useful middle ground, acting like a read only dictionary and parsing sequences into SeqRecord objects on demand.\n",
    "   - `.index_db()` also acts like a read only dictionary but stores the identifiers and file offsets in a file on disk (as an SQLite3 database), meaning it has very low memory requirements, but will be a little bit slower. \n",
    "\n",
    "\n",
    "Let's test the different functions\n",
    "\n",
    "You can use the function `.to_dict()` to make a SeqRecord dictionary (in memory). By default this will use each record’s identifier (i.e. the .id attribute) as the key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "records_dict = SeqIO.to_dict(SeqIO.parse(\"data/patato_pep.fasta\",\"fasta\"))\n",
    "records_dict.keys()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ['PGSC0003DMP400030628', 'PGSC0003DMP400032361', 'PGSC0003DMP400027454', 'PGSC0003DMP400060824', 'PGSC0003DMP400040011', 'PGSC0003DMP400037883', 'PGSC0003DMP400022612', 'PGSC0003DMP400020381', 'PGSC0003DMP400067339', 'PGSC0003DMP400028584']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "records_dict['PGSC0003DMP400020381']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SeqRecord(seq=Seq('MLEKDSRDDRLDCVFPSKHDKDSVEEVSSLSSENTRTSNDCSRSNNVDSISSEV...KY*', SingleLetterAlphabet()), id='PGSC0003DMP400020381', name='PGSC0003DMP400020381', description='PGSC0003DMP400020381 PGSC0003DMT400029984 Protein', dbxrefs=[])\n",
    "\n",
    "`.to_dict()` is very flexible because it holds everything in memory. The size of file you can work with is limited by your computer’s RAM. In general, this will only work on small to medium files.\n",
    "\n",
    "---------------\n",
    "\n",
    "For larger files you should consider `.index()`, which works a little differently. Although it still returns a dictionary like object, this does not keep everything in memory. Instead, it just records where each record is within the file and when you ask for a particular record, it then parses it on demand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "records_dict = SeqIO.index(\"data/patato_pep.fasta\",\"fasta\")\n",
    "list(records_dict.keys())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ['PGSC0003DMP400030628', 'PGSC0003DMP400032361', 'PGSC0003DMP400027454', 'PGSC0003DMP400060824', 'PGSC0003DMP400040011', 'PGSC0003DMP400037883', 'PGSC0003DMP400022612', 'PGSC0003DMP400020381', 'PGSC0003DMP400067339', 'PGSC0003DMP400028584']\n",
    "\n",
    "Note that in this case the `.keys()` function return an iterator and we need to use the list function to get the key values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "records_dict['PGSC0003DMP400020381']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SeqRecord(seq=Seq('MLEKDSRDDRLDCVFPSKHDKDSVEEVSSLSSENTRTSNDCSRSNNVDSISSEV...KY*', SingleLetterAlphabet()), id='PGSC0003DMP400020381', name='PGSC0003DMP400020381', description='PGSC0003DMP400020381 PGSC0003DMT400029984 Protein', dbxrefs=[])\n",
    "\n",
    "Note that `.index()` won’t take a handle, but only a filename. \n",
    "\n",
    "-----------------\n",
    "\n",
    "`.index_db()` work on even extremely large files since it stores the record information as a file on disk (using an SQLite3 database) rather than in memory. Also, we can index multiple files together (providing all the record identifiers are unique).\n",
    "\n",
    "`.index_db()` function takes three required arguments:\n",
    "- Index filename\n",
    "- List of sequence filenames to index (or a single filename)\n",
    "- File format (lower case string as used in the rest of the SeqIO module). \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "patato_pep = SeqIO.index_db(\"patato_pep.idx\", \"data/patato_pep.fasta\",\"fasta\")\n",
    "patato_pep.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ['PGSC0003DMP400020381', 'PGSC0003DMP400022612', 'PGSC0003DMP400027454', 'PGSC0003DMP400028584', 'PGSC0003DMP400030628', 'PGSC0003DMP400032361', 'PGSC0003DMP400037883', 'PGSC0003DMP400040011', 'PGSC0003DMP400060824', 'PGSC0003DMP400067339']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "patato_pep['PGSC0003DMP400040011']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SeqRecord(seq=Seq('MECDTEDSEDNSNIQADSNHRLVKFVIPGNNLLDQTKSSSTKVVLIFLESVEIL...NF*', SingleLetterAlphabet()), id='PGSC0003DMP400040011', name='PGSC0003DMP400040011', description='PGSC0003DMP400040011 PGSC0003DMT400059441 Protein', dbxrefs=[])\n",
    "\n",
    "\n",
    "**So, which of these methods should you use and why ?** [solution](solutions/_seqIO1.ipynb) \n",
    "\n",
    "### Writing sequence files\n",
    "\n",
    "The `.write()` is used to output sequences (writing files). This function taking three arguments: \n",
    "\n",
    " - Some SeqRecord objects. \n",
    " - A handle or filename to write to. \n",
    " - A sequence format.\n",
    "\n",
    "First, let's write a sequence into the file __testOut.fa__ :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "SeqIO.write(simple_seq_r, \"testOut.fa\",  \"fasta\")\n",
    "os.system(\"cat testOut.fa\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> \\>THX1138 Made up sequence I wish I could write a paper about   \n",
    "> AGTACACTGGT\n",
    "\n",
    "then, let's write a set of patato sequences into the file __testOut.fa__ :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for seq_record in SeqIO.parse(\"data/patato_pep.fasta\",\"fasta\") : \n",
    "  SeqIO.write(seq_record, \"testOut.fa\",  \"fasta\")\n",
    "\n",
    "os.system(\"cat testOut.fa\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> \\>PGSC0003DMP400037883 PGSC0003DMT400056292 Protein   \n",
    "> MMIGRDPEIWENPEEFIPERFLNSDIDYFKGQNFELIPFGAGRRGCPGIALGVATINLIL   \n",
    "> SNLLYAFDWELPHGMIKEDIDTDGLPGLAMHKKNALCLVPKNYTHT*\n",
    "\n",
    "\n",
    "**Do you notice something strange in testOut.fa ? and explain why ?** [solution](solutions/_seqIO2.ipynb)\n",
    "\n",
    "\n",
    "###### Exercice\n",
    "\n",
    "**Could you write the content data/patato_pep.fasta into testOut.fa ?** [solution](solutions/_seqIO3.ipynb)\n",
    "\n",
    "\n",
    "## The Blast Class\n",
    "Everybody loves BLAST ! How can it get any easier to do comparisons between one of your sequences and every other sequence in the known world?\n",
    "\n",
    "If you don't know Blast please explore http://blast.ncbi.nlm.nih.gov/Blast.cgi\n",
    "\n",
    "Dealing with BLAST can be split up into two steps, both of which can be done from within Biopython:\n",
    "\n",
    " 1. Running BLAST for your query sequence(s), and getting some output. \n",
    " 2. Parsing the BLAST output in Python for further analysis.\n",
    "\n",
    "To do that Biopython provide the specific `Blast` module.\n",
    "\n",
    "[The detail API of the `Blast` module](http://biopython.org/DIST/docs/api/Bio.Blast-module.html)\n",
    " \n",
    "There are lots of ways you can run BLAST, especially you can run BLAST locally (on your own machine), or run BLAST remotely (on another machine, typically the NCBI servers).\n",
    "\n",
    "In this workshop we will only focus on the remote way to run blast.\n",
    "\n",
    "\n",
    "### Running BLAST on NCBI servers\n",
    "The specific sub-module `Blast.NCBIWWW` allows to call the online version of BLAST through is main function `qblast()`.\n",
    "\n",
    "This function has three mandatory arguments:\n",
    "\n",
    " 1. The blast program to use for the search, as a lower case string. Currently qblast only works with blastn, blastp, blastx, tblast and tblastx.\n",
    " 2. The databases to search against, as a lower case string.\n",
    " 3. The query sequence as a string. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number.\n",
    " \n",
    "Note that the default settings on the NCBI BLAST website are not quite the same as the defaults on QBLAST. If you get different results, you’ll need to check the parameters (e.g., the expectation value threshold and the gap values)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio.Blast import NCBIWWW\n",
    "result_handle = NCBIWWW.qblast(\"blastp\", \"nr\", patato_pep['PGSC0003DMP400040011'].seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Supplying just the sequence means that BLAST will assign an identifier for your sequence automatically. You might prefer to use the SeqRecord object’s format method to make a FASTA string (which will include the existing identifier):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_handle = NCBIWWW.qblast(\"blastp\", \"nr\", patato_pep['PGSC0003DMP400040011'].format(\"fasta\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing and reading\n",
    "Whatever arguments you give the qblast() function, you should get back your results in a handle object (by default in XML format). The next step would be to parse the XML output into Python objects representing the search results, but you might want to save a local copy of the output file first. \n",
    "\n",
    "To output the blast result we use the `read()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_file = open(\"my_blast.xml\", \"w\")\n",
    "save_file.write(result_handle.read())\n",
    "save_file.close()\n",
    "os.system(\"head my_blast.xml\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> \\<?xml version=\"1.0\"?\\>   \n",
    "> \\<!DOCTYPE BlastOutput PUBLIC \"-//NCBI//NCBI BlastOutput/EN\" \"http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd\">   \n",
    "> \\<BlastOutput\\>   \n",
    "> &nbsp;&nbsp;&nbsp;&nbsp;\\<BlastOutput_program\\>blastp\\</BlastOutput_program\\>   \n",
    "> &nbsp;&nbsp;&nbsp;&nbsp;\\<BlastOutput_version\\>BLASTP 2.3.1+\\</BlastOutput_version\\>   \n",
    "> &nbsp;&nbsp;&nbsp;&nbsp;\\<BlastOutput_reference\\>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.\\</BlastOutput_reference\\>   \n",
    ">  &nbsp;&nbsp;&nbsp;&nbsp;\\<BlastOutput_db\\>nr\\</BlastOutput_db\\>   \n",
    ">  &nbsp;&nbsp;&nbsp;&nbsp;\\<BlastOutput_query-ID\\>Query_56965\\</BlastOutput_query-ID\\>   \n",
    ">  &nbsp;&nbsp;&nbsp;&nbsp;\\<BlastOutput_query-def\\>**PGSC0003DMP400040011 PGSC0003DMT400059441 Protein**\\</BlastOutput_query-def\\>   \n",
    ">  &nbsp;&nbsp;&nbsp;&nbsp;\\<BlastOutput_query-len\\>222\\</BlastOutput_query-len\\>   \n",
    "\n",
    "\n",
    "We need to be a bit careful since we can use result_handle.read() to read the BLAST output only once – calling result_handle.read() again returns an either an empty string ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_handle.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ''\n",
    "\n",
    "After doing this, the results are in the file my_blast.xml and the original handle has had all its data extracted, so we can close it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_handle.close()\n",
    "result_handle.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Traceback (most recent call last):   \n",
    "> ...   \n",
    "> ValueError: I/O operation on closed file\n",
    "\n",
    "\n",
    "To acces our blast results we can read the save data using the `parse` function of the BLAST parser. `parse` takes a file-handle-like object as argument so we just need to traditionally open the blast result file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_handle = open(\"my_blast.xml\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parsing BLAST output\n",
    "BLAST can generate output in various formats, such as XML, HTML, and plain text. Unfortunately, the BLAST output in HTML or plain text kept changing, each time breaking the original Biopython parsers. the HTML BLAST parser has been removed, but the plain text BLAST parser is still available but deprecated.\n",
    "\n",
    "Biopython recommend to parse the output in XML format, which can be generated by recent versions of BLAST. Not only is the XML output more stable than the plain text and HTML output, it is also much easier to parse automatically, making Biopython a whole lot more stable.\n",
    "\n",
    "We can get BLAST output in XML format in various ways (internet loccaly, etc...). For the parser, it doesn’t matter how the output was generated, as long as it is in the XML format.\n",
    "\n",
    "The specific sub-module `Blast.NCBIXML` allows to parse BLAST XMLs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio.Blast import NCBIXML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just like the `SeqIO`object, we have a pair of input functions, \n",
    "\n",
    " - `.read()` for when you have exactly one object\n",
    " - `.parse()`to generate an iterator for when you can have lots of objects \n",
    "\n",
    "\n",
    "But instead of getting SeqRecord objects, we get BLAST record objects.\n",
    "\n",
    "We’ve got a handle, we are ready to parse the output. The code to parse it is really quite small if we expect a single BLAST result or if ywe want to parse only the first (next) results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "blast_record = NCBIXML.read(result_handle)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or, if you have lots of results you want to"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_handle.seek(0)\n",
    "blast_records = NCBIXML.parse(result_handle)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** In the example above why are we using the `.seek(0)` methods ?** [solution](solutions/_blast1.ipynb)\n",
    "\n",
    "Note though that you can step through the BLAST records only once. Usually, from each BLAST record you would save the information that you are interested in. If you want to save all returned BLAST records, you can convert the iterator into a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "blast_record_list = list(blast_records)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can access each BLAST record in the list with an index as usual.\n",
    "\n",
    "** What should we be careful of when storing the blast_records into a list ? ** [solution](solutions/_blast2.ipynb)\n",
    "\n",
    "I guess by now you’re wondering what is in a BLAST record !\n",
    "\n",
    "### The BLAST record class\n",
    "A BLAST Record contains everything you might ever want to extract from the BLAST output. Right now we’ll just show an example of how to get some info out of the BLAST report, but if you want something in particular that is not described here, look at the info on the [BLAST record class](http://biopython.org/DIST/docs/api/Bio.Blast.Record-module.html) in detail.\n",
    "\n",
    "A BLAST record contains a variety of informations in sub-Classes (partial list):  \n",
    "\n",
    "- Header - Saves information from a blast header.\n",
    "- Description - Stores information about one hit in the descriptions section.\n",
    "- Alignment - Stores information about one hit in the alignments section.\n",
    "- HSP - Stores information about one hsp in an alignment hit.\n",
    "- Parameters - Holds information about the parameters.\n",
    "- Blast - Saves the results from a blast search.\n",
    "\n",
    "\n",
    "Now let’s just print out some summary info about the hits in our blast report greater than a particular threshold (1e-35)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "E_VALUE_THRESH = 1e-35\n",
    "for alignment in blast_record.alignments:\n",
    "     for hsp in alignment.hsps:\n",
    "         if hsp.expect < E_VALUE_THRESH:\n",
    "             print '****Alignment****'\n",
    "             print 'sequence:' + alignment.title\n",
    "             print 'length:', alignment.length\n",
    "             print 'e value:', hsp.expect\n",
    "             print hsp.query[0:75] + '...'\n",
    "             print hsp.match[0:75] + '...'\n",
    "             print hsp.sbjct[0:75] + '...'\n",
    "            "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> \\*\\*\\*\\*Alignment\\*\\*\\*\\*   \n",
    "> sequence:gi|971553515|ref|XP_015164971.1| PREDICTED: uncharacterized protein LOC102590883 isoform X2 [Solanum tuberosum]   \n",
    "> length: 776   \n",
    "> e value: 2.54083e-38   \n",
    "> MECDTEDSEDNSNIQAD------------------SNHRLVKFVIPGNNL---LDQTK----SSSTKVVLIFLE-...   \n",
    "> ME D EDSED+SNIQAD                   N +L    I        L+Q +    +S+  +    LE ...   \n",
    "> MESDPEDSEDSSNIQADIQQPPSPEICNTGEQPPRPNKKLFNKGIVCRKFERHLEQNEEDVLTSAITLFAFSLEK...   \n",
    "> \\*\\*\\*\\*Alignment\\*\\*\\*\\*   \n",
    "> sequence:gi|565365084|ref|XP_006349247.1| PREDICTED: uncharacterized protein LOC102590883 isoform X1 [Solanum tuberosum]   \n",
    "> length: 792   \n",
    "> e value: 1.91084e-36   \n",
    "> SVEILTSVAKKVHMQLQNAEFHIQADMGRITSLNKLKRKHVEEVLQEKLQHLSSIYERCKEEVTRHLQDCKSTLQ...   \n",
    "> S EIL SVA+K+HMQLQNAEF IQADMGRITSLNK KRKHVEEVLQEK QHLS+IYER KEEVTRHLQDCKSTL+...   \n",
    "> SAEILNSVAEKIHMQLQNAEFQIQADMGRITSLNKSKRKHVEEVLQEKQQHLSAIYERFKEEVTRHLQDCKSTLE...   \n",
    "\n",
    "\n",
    "Basically, we can do anything we want to with the info in the BLAST report once we have parsed it. This will, of course, depend on what we want to use it for.\n",
    "\n",
    "[Here are is UML class diagrams for the Blast record class] (http://biopython.org/DIST/docs/tutorial/Tutorial.html#fig:blastrecord)\n",
    "\n",
    "[Here are is UML class diagrams for the PSIBlast class] (http://biopython.org/DIST/docs/tutorial/Tutorial.html#fig:psiblastrecord)\n",
    "\n",
    "## Multiple Alignment Sequences\n",
    "Multiple Sequence Alignments are a collection of multiple sequences which have been aligned together, usually with the insertion of gap characters, such that all the sequence strings are the same length. Alignment can be regarded as a matrix of letters, where each row is held as a SeqRecord object internally.\n",
    "\n",
    "the `MultipleSeqAlignment` object holds this kind of data, and the `AlignIO` module fis used for reading and writing them as various file formats.\n",
    "\n",
    "[The detail API of the `AlignIOt` module](http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html)\n",
    "\n",
    "In this workshop we won't cover the specific modules dedicated to command line wrappers for common multiple sequence alignment tools !\n",
    "\n",
    "### Parsing or Reading Sequence Alignments\n",
    "`AlignIO` contains 2 functions for reading in sequence alignments:\n",
    "\n",
    "- `read()` - will return a single `MultipleSeqAlignment` object\n",
    "- `parse()` - will return an iterator which gives `MultipleSeqAlignment` objects\n",
    "\n",
    "Both functions expect two mandatory arguments:\n",
    "\n",
    "- A string specifying a handle to an open file or a filename.\n",
    "- A lower case string specifying the alignment format. \n",
    "\n",
    "[See here for a full listing of supported formats](http://biopython.org/wiki/AlignIO)\n",
    "\n",
    "\n",
    "And 2 optional arguments:\n",
    "\n",
    "- The seq_count argument (integer) specifying the number of alignment in case of ambiguous file formats.\n",
    "- The alphabet argument allowing to specify the expected alphabet.\n",
    "\n",
    "#### Singles alignments\n",
    "Let's start with a single alignments file which contains the alignments of the 10 differents patato petitides we used previously (file __data/muscle-patato_pep.clw__). The alignments were generated using MUSCLE (MUltiple Sequence Comparison by Log-Expectation) which is a refeence for aligning amino-acide sequences and output in the CLUSTAL format.\n",
    "\n",
    "First let's look what is inside the file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.system(\"head data/muscle-patato_pep.clw\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> CLUSTAL multiple sequence alignment by MUSCLE (3.8)   \n",
    ">   \n",
    ">   \n",
    "> PGSC0003DMP400022612      ------------------------------------------------------------   \n",
    "> PGSC0003DMP400060824      ---------------MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRL--   \n",
    "> PGSC0003DMP400067339      ---------------MGVWKDSNYGKGVIIGVIDT--GILPDHPSFSDVGMPPPPAKWKG   \n",
    "> PGSC0003DMP400027454      MPHPTQVVALLKAQQIRHVRLFNADRGMLLALANTGIKVAVSVPNEQILGVGQSNTTAAN   \n",
    "> PGSC0003DMP400028584      ------------------MSTSVEPNGAVL--------------LDSTAGSGGGVANSNG   \n",
    "> PGSC0003DMP400030628      ------------------------------------------------------------   \n",
    "> PGSC0003DMP400020381      ------------------MLEKDSRDDRLDCVFPS---KHDKDSVEEVSSLSSENTRTSN   \n",
    "\n",
    "We can load this file as follows :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio import AlignIO\n",
    "aln_patato = AlignIO.read(\"data/muscle-patato_pep.clw\", \"clustal\")\n",
    "print aln_patato"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SingleLetterAlphabet() alignment with 10 rows and 1294 columns   \n",
    "> --------------------------------------------...--- PGSC0003DMP400022612   \n",
    "> ---------------MQIFVKTLTGKTITLEVESSDTIDNVKAK...GGF PGSC0003DMP400060824   \n",
    "> ---------------MGVWKDSNYGKGVIIGVIDT--GILPDHP...LDK PGSC0003DMP400067339   \n",
    "> MPHPTQVVALLKAQQIRHVRLFNADRGMLLALANTGIKVAVSVP...TSS PGSC0003DMP400027454   \n",
    "> ------------------MSTSVEPNGAVL--------------...--- PGSC0003DMP400028584   \n",
    "> --------------------------------------------...--- PGSC0003DMP400030628   \n",
    "> ------------------MLEKDSRDDRLDCVFPS---KHDKDS...--- PGSC0003DMP400020381   \n",
    "> --------------------------------------------...--- PGSC0003DMP400040011   \n",
    "> MEIGLAVGGAFLSSALNVLFDRLAPHGDLLNMFQKHTDDVQLFE...YL- PGSC0003DMP400032361   \n",
    "> --------------------------------------------...--- PGSC0003DMP400037883   \n",
    "\n",
    "Note in the above output the sequences have been truncated. We could instead write our own code to format this as we please by iterating over the rows as `SeqRecord` objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for record in aln_patato:\n",
    "   print(\"%s - %s\" % (record.seq[1:60], record.id))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ----------------------------------------------------------- - PGSC0003DMP400022612   \n",
    "> --------------MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRL-- - PGSC0003DMP400060824   \n",
    "> --------------MGVWKDSNYGKGVIIGVIDT--GILPDHPSFSDVGMPPPPAKWKG - PGSC0003DMP400067339   \n",
    "> PHPTQVVALLKAQQIRHVRLFNADRGMLLALANTGIKVAVSVPNEQILGVGQSNTTAAN - PGSC0003DMP400027454   \n",
    "> -----------------MSTSVEPNGAVL--------------LDSTAGSGGGVANSNG - PGSC0003DMP400028584   \n",
    "> ----------------------------------------------------------- - PGSC0003DMP400030628   \n",
    "> -----------------MLEKDSRDDRLDCVFPS---KHDKDSVEEVSSLSSENTRTSN - PGSC0003DMP400020381   \n",
    "> ----------------------------------------------------------- - PGSC0003DMP400040011   \n",
    "> EIGLAVGGAFLSSALNVLFDRLAPHGDLLNMFQKHTDDVQLFEKLGDILLGLQIVLSDA - PGSC0003DMP400032361   \n",
    "> ----------------------------------------------------------- - PGSC0003DMP400037883   \n",
    "\n",
    "With any supported file format, we can load an alignment in exactly the same way just by changing the format string. For example, use “phylip” for PHYLIP files, “nexus” for NEXUS files or “emboss” for the alignments output by the EMBOSS tools.\n",
    "\n",
    "\n",
    "#### Multiple Alignments\n",
    "In general alignments files can contain multiples alignment, and to read these files we must use the `AlignIO.parse()` function.\n",
    "\n",
    "Suppose you have a multiple alignments file in PHYLIP format (dummy alignments) :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.system(\"head data/dummy_aln.phy\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> &nbsp;&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;6   \n",
    "> Alpha     AAACCA   \n",
    "> Beta      AAACCC   \n",
    "> Gamma     ACCCCA   \n",
    "> Delta     CCCAAC   \n",
    "> Epsilon   CCCAAA   \n",
    "> &nbsp;&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;6   \n",
    "> Alpha     AAACAA   \n",
    "> Beta      AAACCC   \n",
    "> Gamma     ACCCAA   \n",
    "\n",
    "If we wanted to read this in using `AlignIO` we could use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aln_dummy = AlignIO.parse(\"data/dummy_aln.phy\", \"phylip\")\n",
    "for alignment in aln_dummy:\n",
    "    print alignment\n",
    "    print \"\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SingleLetterAlphabet() alignment with 5 rows and 6 columns   \n",
    "> AAACCA Alpha   \n",
    "> AAACCC Beta   \n",
    "> ACCCCA Gamma   \n",
    "> CCCAAC Delta   \n",
    "> CCCAAA Epsilon   \n",
    ">    \n",
    "> ...   \n",
    "> SingleLetterAlphabet() alignment with 5 rows and 6 columns   \n",
    "> AAAAAC Alpha   \n",
    "> AAACCC Beta   \n",
    "> AACAAC Gamma   \n",
    "> CCCCCA Delta   \n",
    "> CCCAAC Epsilon   \n",
    ">    \n",
    "\n",
    "\n",
    "The `.parse()` function returns an iterator. If we want to keep all the alignments in memory at once, then we need to turn the iterator into a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alignments = list(AlignIO.parse(\"data/dummy_aln.phy\", \"phylip\"))\n",
    "second_aln = alignments[1]\n",
    "print second_aln\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> SingleLetterAlphabet() alignment with 5 rows and 6 columns   \n",
    "> AAACAA Alpha   \n",
    "> AAACCC Beta   \n",
    "> ACCCAA Gamma   \n",
    "> CCCACC Delta   \n",
    "> CCCAAA Epsilon   \n",
    "\n",
    "\n",
    "### Writing Alignments\n",
    "Now we’ll look at `AlignIO.write()` which is for alignments output (writing files). \n",
    "\n",
    "This function takes 3 arguments: \n",
    "- Some MultipleSeqAlignment objects \n",
    "- A string specifying a handle or a filename to write to\n",
    "- A lower case string specifying the sequence format.\n",
    "\n",
    "We start by creating a MultipleSeqAlignment object the hard way (by hand). Note we create some SeqRecord objects to construct the alignment from."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio.Alphabet import generic_dna\n",
    "from Bio.Align import MultipleSeqAlignment\n",
    "align1 = MultipleSeqAlignment([\n",
    "    SeqRecord(Seq(\"ACTGCTAGCTAG\", generic_dna), id=\"toto\"),\n",
    "    SeqRecord(Seq(\"ACT-CTAGCTAG\", generic_dna), id=\"titi\"),\n",
    "    SeqRecord(Seq(\"ACTGCTAGDTAG\", generic_dna), id=\"tata\"),\n",
    "])\n",
    "\n",
    "print align1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> DNAAlphabet() alignment with 3 rows and 12 columns   \n",
    "> ACTGCTAGCTAG toto   \n",
    "> ACT-CTAGCTAG titi   \n",
    "> ACTGCTAGDTAG tata   \n",
    "\n",
    "Now let's try to output, in phylip format, these alignments in a file with patato peptide alignments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_alignments = [align1, aln_patato]\n",
    "AlignIO.write(my_alignments, \"mixed.phy\", \"phylip\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Traceback (most recent call last):\n",
    ">  ...\n",
    "\n",
    "\n",
    "**What is the source of the error ?** [solution](solutions/_align1.ipynb)\n",
    "\n",
    "**How can we resolve it ?** [solution](solutions/_align2.ipynb)\n",
    "\n",
    "\n",
    "Note - As for `SeqIO.wrtie()`, if you tell the `AlignIO.write()` function to write to a file that already exists, the old file will be overwritten without any warning.\n",
    "\n",
    "\n",
    "###### Exercice\n",
    "\n",
    "**Could you write an alignment converter function (reading,writing) ?** [solution](solutions/_align3.ipynb)\n",
    "\n",
    "\n",
    "\n",
    "# The Other interesting module of Biopython Object\n",
    "During this workshop we only scratch the surface of the full potential of Biopython. Many other interesting modules are available, between others:\n",
    "\n",
    "|module|task|\n",
    "|---|---|\n",
    "|Bio.Aff|Deal with Affymetrix related data such as cel files|\n",
    "|Bio.Emboss|Code to interact with the EMBOSS programs|\n",
    "|Bio.Entrez|Provides code to access NCBI over the WWW|\n",
    "|Bio.HMM|A selection of Hidden Markov Model code|\n",
    "|Bio.KDTree|KD tree data structure for searching N-dimensional vectors|\n",
    "|Bio.KEGG|This module provides code to work with data from the KEGG database|\n",
    "|Bio.PDB|Classes that deal with macromolecular crystal structures|\n",
    "|Bio.Pathway|BioPython Pathway module|\n",
    "|Bio.Phylo|Package for working with phylogenetic trees|\n",
    "|Bio.PopGen|PopGen: Population Genetics and Genomics library in Python|\n",
    "|Bio.SeqUtils|Miscellaneous functions for dealing with sequences|\n",
    "|...||\n",
    "\n",
    "**I strongly encourage you to look at he full list of module and to explore those you found relevent with your own research.**\n",
    "\n",
    "# Homework\n",
    " \n",
    " You will find a file called [sequence.gbk](data/sequence.gbk) which contain the sequence of Mystery cRNA. The aim of the excerice is to create a script that will take in input any gbk file of a cDNA and do the following:  \n",
    "- Extract the DNA sequence\n",
    "- Translate it into a RNA sequence\n",
    "- Transcript it into a protein sequence\n",
    "- Blast the protein sequence against the nr database\n",
    "- Retreive the name of the gene (if possible).\n",
    "- output every sequence in a specific file using the name of the gene (or unknown if not possible)\n",
    "\n",
    "You should try to comment your code\n",
    "You should try to separated each step in a specific function or method\n",
    "\n",
    "**There are not one unique solution to do that ! Be creative ! I will add my own implentation soon**\n",
    "\n",
    "# Aknowledgments\n",
    "This tutorial is an adaptation of some part of the one Biopython Tutorial avaialable on-line. I would like to thank the Biopytho team for their fabulous work."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 0
}
